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1. Introduction 

When learning a classifier from labeled training data, minimizing the probability 
of misclassification is often unsatisfactory. In a variety of applications, such as 
screening medical images for cancerous lesions or detecting landmines, false 
positives and negatives have different impacts. False detections of targets are 
problematic because of the time, money, and other resources which are invariably 
wasted as a result. Missed detections, on the other hand, may result in loss of 
life or destruction. For this reason, a number of methods for cost-sensitive [3, 12] 
and Noyman-Pearson [6, 17, 18] classification have bcc;n developed that allow the 
user to effect a tradeoff between false positive and negative rates. 

The probability of error, false positive rate, and false negative rate are all 
performance measures that reflect the performance of a classifier on a single 
future test point. However, it is often the case that we desire to classify multiple 
future test points. In this situation, the false positive and negative rates may 
not be the most appropriate measures of performance. If a classifier has a false 
positive rate of say 5%, and 1000 negative test points (e.g., no target present) 
are observed, we expect 50 of them to be declared positive. This may be unac- 
ceptable, especially in situations where large costs are involved in investigating 
false positives. 

This situation is similar to the multiple testing problem in hypothesis testing. 
Consequently, many of the ideas from multiple testing are applicable in the 
classification setting. The basic approach is to consider alternative measures of 
size and power that are better suited to simultaneous inference, and to design 
decision rules based on these new performance measures. 

In this paper, we consider the false discovery rate (FDR) [20], which has 
emerged as the method of choice for quantifying error rates meaningfully in 
many multiple testing situations, with applications ranging from wavelet de- 
noising [8] to neuroimaging [13] to the analysis of DNA microarrays [10]. Con- 
trol of the FDR, i.e., the fraction of declared positives (discoveries) that are in 
fact negative, ensures that follow-up investigations into declared positives must 
return a certain yield of actual positives. Such control is vital in applications 
where follow-up studies are time or resource consuming. 

Several researchers, spurred by the seminal work of [4], have studied FDR 
control in the context of multiple hypothesis testing by assuming known dis- 
tributions of observed statistics under the null hypothesis. FDR control is then 
achieved, typically, by adjusting p-valucs through single step, step-up or step- 
down procedures. It is important to note that such procedures are not applicable 
in the statistical learning context because we do not assume knowledge of the 
null distribution and must instead rely upon training data. 

We develop basic results on the analysis of generalization error in FDR con- 
trolled classification, including uniform deviation bounds, finite sample perfor- 
mance guarantees, and strong universal consistency. Unlike traditional perfor- 
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mance probabilities, whose empirical versions are related to binomial random 
variables, empirical versions of FDR and FNDR are related to ratios of binomial 
variables. This necessitates the development of novel concentration inequalities 
and methods of analysis. 



1.1. Notation 



More formally, in this paper we consider the following scenario: Let X be a 
set and Z = {X,Y) be a random variable taking values in Z = X x {0,1}. 
The variable X corresponds to a pattern or feature vector and F to a class 
label associated with X; Y = corresponds to the null hypothesis (e.g., that 
no target is present) and Y = 1 corresponds to the alternative hypothesis (e.g., 
that a target is present). The distribution on Z is unknown and is denoted by P. 
Assume we make n independent and identically distributed training observations 
of Z, denoted = {Xi, Yi)2^^. 

A classifier is a function h : X — > {0, 1} mapping feature vectors to class 
labels. Let IK denote a collection of different classifiers. A false discovery occurs 
when h{X) = 1 but the true label is F = 0. Similarly, a false nondiscovery 
occurs when h{X) = but Y = 1. We define the false discovery rate (FDR) 

P{Y = 0\h{X) = l), if P(/i(X) = 1) > 0, 
00, else. 



and the false nondiscovery rate (FNDR) 
^ND{h) := 



P(F = l|/i(X) = 0), if P(/i(X) = 0) > 0, 
DO, else. 



1.2. Related Concepts 

These definitions, which are natural in the classification setting, coincide with 
the so-called positive FDR/FNDR of Storey [21,22], so named because it can 
be seen to equal the expected fraction of false discovcries/nondiscovcries, con- 
ditioned on a positive number of discoveries/nondiscoveries having been made. 
Storey makes some decision-theoretic connections to classification [22], but does 
not consider learning from data. 

Storey's definition does not cover the case where the conditioning event has 
probability zero. We define FDR and FNDR in these cases to be infinity. Our 
convention has the effect of assigning high costs to classifiers that fail to make at 
least some discoveries (and nondiscoveries) . This is consistent with the multiple 
testing perspective, where the goal is to generate interesting hypotheses for 
further examination. A classifier that makes no discoveries is of no use for such 
purposes. Further comments on the definition of FDR and FNDR are given after 
the proof of Theorem 2. 

In certain communities, different terms embody the idea behind FDR. In the 
medical diagnostic testing literature, the positive predictive value (PPV) is de- 
fined as the "proportion of patients with positive test results who are correctly 
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diagnosed" [1]. In database information retrieval problems, the precision is de- 
fined as the ratio of the number of relevant documents retrieved by a search 
to the total number of documents retrieved by a search [23]. Both PPV and 
precision are equal to 1 - FDR. Precision is discussed further is Section 5.2. 

Finally, several researchers have recently investigated connections between 
multiple testing and statistical learning theory. McAllester's PAC-Bayesian learn- 
ing theory may be viewed as an extension of multiple testing procedures to (pos- 
sibly imcountably) infinite collections of hypotheses [16]. Blanchard and Fleuret 
present an extension of the Occam's razor principle for generalization error anal- 
ysis in classification, and apply it to derive p-value adjustment procedures for 
controlling FDR [5] . Arlot et al. develop concentration inequalities that apply to 
multiple testing with correlated observations [2]. None of these works consider 
FDR/FNDR as performance criteria for classification. 

1.3. Connections to Cost- Sensitive Learning 

In Sections 3 and Section 4 we consider the performance measure E\{h) := 
^ND{h) + X^oih). It can be shown that the global minimizers of this criterion 
have the form 



for some c, where r]{x) := P{Y = 1\X = x) and, if necessary, this family 

of classifiers is extended by a standard randomization argument if its receiver 
operating characteristic (ROC) is not concave. Storey [22] gives a proof for the 
case where the two class-conditional distributions are continuous. The classifiers 
in (1) are also the optimal classifiers for Bayes cost-sensitive learning. That is, 
they are also the minimizers of weighted Bayes costs of the form 



7 > 0, where c = 1/(1-1-7). Proof of this fact is a direct generalization of the 
case of the probability of error, when 7=1 [7]. 

Unfortunately, existing analyses for cost-sensitive classification cannot be 
readily applied to our problem. Given A, it is true that our criterion Ex{h) 
can be minimized by performing cost-sensitive classification with a certain cost 
parameter 7. The critical issue is that 7 is an implicit function of A, and can- 
not be determined a priori without knowledge of the underlying distribution. 
Therefore, when only data are given, applying existing cost-sensitive classifica- 
tion methods to our problem would require estimating 7. In practice, this would 
most likely entail learning cost-sensitive classifiers h^^ for some grid of values 
{7i} that grows increasingly dense as n ^ 00. Then, the best of these candidates 
would be selected by minimizing an estimate of E\ (h) . Such a procedure would 
likely be expensive computationally. From an analytical standpoint, it seems 
plausible that generalization error analyses for cost-sensitive classification could 
be useful; however, the need to search for a 7 that approximately minimizes our 
criterion would significantly complicate the analysis. The objective of our work 



h{x) = l{,,(x)>c} 



(1) 



P{h{X) = 0, F = 1) + ^P{h{X) = 1, y = 



0) 
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is to develop a much more direct approach, which does not require repeated 
cost-sensitive classification. 

Therefore, the distinction between our problem and cost-sensitive classifica- 
tion is in some ways analogous to the difference between the Neyman-Pearson 
and Baycsian theories of hypothesis testing. Even though these two problems 
have a likelihood ratio as their optimal solution, the specific thresholds for the 
likelihood ratios are determined in very different ways depending on which cri- 
terion is employed. In our setting, the differences are further compounded by 
the fact that we are learning from data. 

1.4- Overview 

In the next section we present and prove uniform deviation bounds for FDR 
and FNDR. In Section 3, we discuss performance measures based on FDR and 
FNDR, and in Section 4 we establish the strong universal consistency of a learn- 
ing rule with respect to the measure £a- Section 5 treats performance measures 
which constrain FDR, and the final section offers a concluding discussion. Sev- 
eral aspects of our analysis deviate from standard techniques, a consequence of 
certain unique features of FDR and FNDR, and we highlight these throughout 
the paper. 

2. Uniform Deviation Bounds 

Define empirical analogues to the FDR and FNDR according to 



where nn(/i, Z") = X^Li l{/i(xo=i} and nNoiK Z"-) = J2i=i l{/i(xo=o} are bi- 
nomial random variables. This section describes a uniform bound on the amount 
by which the empirical estimate of FDR/FNDR can deviate from the true value. 
Note that unlike the usual empirical estimates for the probability of error/false 
positive rate/false negative rate, here both numerator and denominator are ran- 
dom, and both depend on h. 

Assume is countable, and let be a real valued functional on such 
that X^/iGM 2^I''l < 1. Such a functional can be identified with a prefix code for 
J{, in which case is the codelength associated to h. If '^h^^ 2~''']1 = 1, then 
2-I'»I may be viewed as a prior distribution on IK. 



imsart-ejs ver. 2008/08/29 file: ejs_2009_363.tex date: January 27, 2009 




nD{h,Z^) > 0, 
noih, Z") = 0, 
UNoih, Z^) > 0, 
nND{h,Z")=0, 



Scott et al./FDR in Statistical Pattern Classification 
For S > 0, we introduce the penalty terms 



(f>D{h,5) 



V 2nn(h,Z^) ' nD(n,Z j > U, 

OO, nB(/i,Z")=0, 



'}ND 



I / log2+log(2/^) „^^^(U n 
(h,6) := I V 2n^n{h,Zr^) ' '^WDi/i,^ j > U, 

1 OO, njvD(/i,^") = 0. 



The interpretation of these expressions as penalties comes from the learning 
algorithms studied below, where we minimize the empirical error plus a penalty 
to avoid overfitting. Note that the penalties are data dependent. 

Theorem 1. With probability at least 1 — 6 with respect to the draw of the 
training data, 

\5lD{h)-^D{h)\<(f>Dih,S) (2) 

for all h G'K. Similarly, with probability at least 1 — 6 with respect to the draw 
of the training data, 

\OiND {h) - ^ND {h) I < (l)ND {h, 5) (3) 

for all h € "K. The results are independent of the underlying probability distri- 
bution. 

Because of the form of the penalty terms, the bound is larger for classifiers h 
that are more complex, as represented through the codelength and smaller 
when more discoveries/nondiscoveries are made. This result leads to finite sam- 
ple bounds and strong universal consistency for certain learning rules based on 
minimization of the penalized empirical error, as developed in the sequel. 

Proof. We prove the first statement, the second being similar. For added clarity, 
write the penalty as (poih, 6,nD{h, Z^)), where 



/ M log2+log(2/^) , „ 

[ OO, fc = 0. 

Consider a fixed h £ The fundamental concentration inequality underlying 
our bounds is Hoeffding's [14], which, in one form, states that if Sk is the sum 
of A; > independent random variables bounded between zero and one, and 
fj, = E[S'fe], then 

P(|M-S'fe| > ke) < 2e-'^^'\ 

To apply Hoeffding's inequality, we need the following conditioning argu- 
ment. Let V = (Vi,...,V^) e {0,1}" be a binary indicator vector, with 
Vi = l{h{Xi)=i}- Let Vk denote the set of all v = {vi, . . . , v„) e {0, 1}" such 
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that Vi=k. We may then write 

P(|3?i,(/i) - ^D{h)\ > S,nD{h, Z"))) 

n 

= P{\^D{h) -3iDih)\ > cl>D{h,S,k)\V = v)P{V = v) 

n 

= Pi\k^D{h) - k^D{h)\ > kcPoih, 6, k)\V = v)P{V = v), 

k=OveVk 

First note that \JiD{h) — Jioih)] < (5) with probabihty one when 

noih, Z"-) = 0. We now apply Hoeflding's inequahty for each fc > 1 and v €Vk, 
conditioning onV = v. Setting Sk = k^oih), we have 

= E[Sk\V = v] 

= kE[^D{h)\V = v] 



i=l 

= E[^ l^Y.=o}\V = v] 



i:vi = l 

= kP{Y = 0\h{X) = 1) 
= kJioih), 

where in the next to last step we use independence of the realizations. Therefore, 
applying Hoeffding's inequality conditioned on ^ = t; e Vfc yields 

n 
n 

k=iveVk 

= (52-M(l-P(X;Vi = 0)) <(52-M. 

The result now follows by applying the union bound over all h G "K. □ 

The technique of conditioning on the random denominator of a ratio of bino- 
mials has also been applied in others settings [15, 18]. Unlike those works, how- 
ever, here the binomial denominator depends on the classifier h. This presents 
difficulties for extending the above techniques to uncountable classes Ji. See the 
final section for further discussion of this issue. 

3. Measuring Performance 

We would like to be able to make FDR/FNDR related guarantees about how 
a data-based classifier h performs. For this, we need to specify a performance 
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measure or optimality criterion that incorporates both FDR and FNDR quan- 
tities simultaneously. One possibility is to specify a number < a < 1 and seek 
the classifier such that ^ND{h) is minimal while 'Jioih) < a. We consider this 
setting in Section 5. Another is to specify a constant A > reflecting the relative 
cost of FDR to FNDR, and minimize 

£x{h) ■.= :kND{h) + XOlD{h). 

This measure was introduce by Storey [22], but was not studied in a learning 

context. The uniform deviation bounds of the previous section immediately 
imply the following computable bound on a classifier's performance with respect 
to this measure. 

Corollary 1. For any 6 > and n > 1, with probability at least 1 — 2S with 
respect to the draw of the training data, 

£x{h) < ^ND{h) + (pNoih, 6) + X[^D{h) + (j^oih, 5)] 

for all he'K. 

In the next section, we analyze a learning rule based on minimizing the bound 
of Corollary 1, and establish its strong universal consistency. 

4. Strong Universal Consistency 

Denote the globally optimal value of the performance measure by 

£^ :=inf £a(/i), 

h 

where the inf is over all measurable /i : X — > {0, 1}. We seek a learning rule h\^n 
such that £,\{h\ n) £\ almost surely, regardless of the underlying probability 
distribution. Thus let {^k}k>i be a family of finite sets of classifiers with univer- 
sal approximation capability. That is, assume that limfc^oo inf^igM^ £\{h) = £^ 
for all distributions on {X,Y). Furthermore, assume this family to be nested, 
meaning IKi C IK2 C ^{3 • • • . For example, if X = [0, 1]'', we may take IKfc to be 
the collection of histogram classifiers based on a binwidth of 2"*^. Recall that 
wc can set Jft,| = log2 \^k\ for h G J-Ct, where \0-Ck\ is the cardinality of IKfc. For 
histograms, we have \%k \ = 2^ and hence = 2*^'' log 2. 

The bound of Corollary 1 suggests bound minimization as a strategy for se- 
lecting a classifier empirically. However, rather than minimizing over all possible 
classifiers in some Oik, we first discard those classifiers whose empirical num- 
bers of discoveries or nondiscoveries are too small. In these cases, the penalties 
are possibly quite large, and we are unable to obtain tight concentrations of 
empirical FDR/FNDR measures around their true values. However, as n in- 
creases, we are able to admit classifiers with increasingly small proportions of 
(non)discoverics, so that in the limit, we can still approximate arbitrary dis- 
tributions. This aspect is another unique feature of FDR/FNDR compared to 
traditional performance measures. 
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Formally, set Sn = and define 

n n 

where p„ := (logn)""'^. Here fc„ is such that A;„ ^ oc as n ^ oo and log \^kn I = 
o(n/ log n). For the histogram example, log|5{/c„| = 2*^"'' log 2, and thus the 
assumed conditions on the growth of kn are essentially the same (up to a loga- 
rithmic factor) as for consistency of histograms in other problems. For example, 
in standard classification, 2*^"'' = o(n) is required [7]. 
Denote the bound of Corollary 1 by 

2x{h) := ^ND{h) + C^ND{h,6n) + X[JiD{h) + Mh,Sn)], 

and define the classification rule 

hx,n ■■= argmin £.x{h). 

If !K„ = 0, then h\ ,a may be defined arbitrarily. 

Theorem 2. For any distribution on {X,Y), and any A > 0, 

^x{hx,n) —>■ S-x 

almost surely. That is, hx,n is strongly universally consistent. 

Proof. First consider the case where there is no measurable h : X ^ {0, 1} such 
that both P(/i(X) = 0) > and P{h{X) = 1) > 0. This occurs when X is 
deterministic. Then £^ = oo, and trivially hx,n achieves optimal performance. 
So assume this is not the case. 

By the Borel-Cantelli lemma [7,9], it suffices to show that for each e > 

oo 

^P(0")<oo, 

71=1 



where 

Introduce the event 



f2" := {Z^ ■.ex{hx,n)-£-*x>e}. 
G" = {Z" : 5{„ 7^ 0}. 



Then 

p(f2") = p(f2"|e")P(e") + p(n"|e^)p(e^) < p(n"|e") + p(e^), 

and therefore 

oo oo oo 

^p(n") < ^p(f2"|e") + ^p(e^). (4) 

n—1 n—1 n=l 

We will bound these two terms separately. 
Consider the second term. 
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Lemma 1. Let u > and assume E\ < oo. There exist h' and Ni such that 
£a(/i') < £a + "-^d, for all n > Ni, P{h' G ?<:„)> 1 - l/n^. 

Proof. By the universal approximation assumption, there exists m and h' G 
'Kk^ such that £,\{h') < £^ + i^. Since £^ < oo, this h' necessarily has both 
P{h'{X) = 0) > and P{h'{X) = 1) > 0. Denote q := min{P(/i'(X) = 
l),P(/i'(X) = 0)} > 0. Introduce 

/ l0g(2/^n) 

V 2n • 

By Hoeffding's inequality, with probability at least 1 — Sn, \P{h'{X) = 1) — 
nD{h',Z")/n\ ^ \P{h'{X) = 0) - nNoih' , Z'')/n\ < t„. Since 5„ = l/n^, we 
have that t„ = o(p„). Now choose A^i such that < and 2pAfj < q. Then, 
for a sample of size n = Ni, min{n£)(/i', Z'^)/Ni,nf^o{h' , Z^)/N{\ > q — ejvi > 
"^PNi — > PNi with probability at least 1 — 5j^-^ = 1 — l/Ni^. Since p„ is 
decreasing and {IKfe} is nested, the same is true for all n > A^i. □ 

By this lemma we have P(8") < (5„ = 1/n^ for all n > Ni (Here we only the 
need the second part of the conclusion of the lemma; later we use the lemma in 
its full generality). Thus 

oo ^ 
n=l n>Ni 

Now consider the first term on the right-hand side of (4). Define the events 
:= {Z" : ex(hx,n) - inf ^x{h) > ^} 
05 := {Z": inj E^/i) - £* > 1} 

Since ^1" c [J O^, we have 

OO OO OO 

^p(i^"|e") < ^p(i^i'|e") + ^p(i75|e"). (5) 

n=\ n=\ n=l 

We consider the two terms individually and show that each of them is finite. 

To bound the first term on the right-hand side of (5) we use the following 
lemma. 

Lemma 2. // IK„ ^ 0, then 

Ex{hx,n)- inf £aW < 2 sup \Ex{h) -Ex{h)\. 
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Proof. Let h' € CK„ be arbitrary. By the definition of h\^n, ^\ih\,n) < £-\{h')- 
Hence 

e-xihx,n) = £-x{hx,n)-h{hx,n) + hihx,n)-^x{h') + Ex{h') 

< £.x{hx,n) - Mhx,n) + £a(/i') - £x{h') + Ex{h') 

< 2 sup \Ex{h)-£x{h)\ + Ex{h'). 

Since h' was arbitrary, the result now follows. □ 
Define the events 



"ll 


:= {Z" : 


; sup 


\OlND{h)-^ND{h)\ 


>-} 

- 16^ 


"12 


:= {Z" 


: sup 


\^D{h)-5iD{h)\ > 


16A^ 


"13 


:= {Z" : 


: sup 


\(liND{h,Sn)\ > ^} 




"14 


:= {Z" 


: sup 

heSin 


\MKSn)\>^} 





From Lemma 2 it follows that 

4 

j=i 

and hence it suffices to show 



is finite for each i = 1,2, 3, 4. We shall consider On and O13, the other two cases 
following similarly. 

For h e J{„ we have nND{h,Z^)/n > p„ and therefore 



'JND 



{h,Sn) 



/ log|J{fcJ+log(2n2) 
y 2nND{h,Z'') 

< Y^(log|5t,J+log(2n2))^<^ 



for n> N2, for some sufficiently large. Here we use Sn = Ijv? and log | = 
o(n/ log n). Then 

00 

^P(Q?3|e")<7V2. 

n=l 
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Furthermore, by the uniform deviation bound, 

n=l Ti>N2 

Now consider the event Applying Lemma 1 with v = e/2, we have that 

oo 

J2^{^2\Qn)<N,+ ^ <0O. 

71=1 n>JVi 

□ 

In the definitions of 3?d(/i) and OlNoih), we define these quantities to be 
infinity when the conditioning event has probabihty zero (sec Introduction). 
This forces the globally optimal classifier to have both P{h{X) = 1) > and 
P{h{X) = 0) > 0) whenever possible. The same property would hold provided 
Jioih) > (1 + A)/A when P{h{X) = 1) = and JlNoih) > (1 + A) when 
P{h{X) = 0) = 0. Were wc to define FDR or FNDR to be smaller, our con- 
sistency argument would not apply universally. In particular, it might fail for 
distributions where the global minimizer of £.\ has either P{h{X) = 0) = or 
P{h{X) = 0) = 1, such as when X is deterministic. In a preliminary version of 
this work, we defined and OlNoih) to be zero when the conditioning event 

is a null event, and were able to prove consistency under a very mild condition 
on the underlying distribution [19]. 



5. Constraining FDR 

In this section we apply Theorem 1 to analyze a rule that seeks to minimize 
the FNDR subject to the constraint that FDR < a, where a is a user-defined 
significance level. In fact, we first present a more general result, and then deduce 
results for this and other constrained learning problems as corollaries. 

Thus, let IK be a collection of classifiers as before, but not necessarily finite. 
Let 3^0 and Jii be measures of Type I and Type II error. For example, these 
may be FDR and FNDR, false positive rate and false negative rate, or some 
combination thereof. Assume that for i = 0, 1, there exists a data-based estimate 
IRi of and a penalty (l)i(h,5), which define a symmetric confidence interval 
for IRj. That is, suppose that for any < (5 < 1, 

Pz.(sup [\%{h) - %{h)\ - (t)i{h,5)] > 0) < 5. 

h&'K 

For < a < 1 define 

/i«,a = argmin 3li(/i) 
s. t. 3?o(/i) < a- 
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Consider the learning rule 

hn,a = a.Tgmin 5ii{h) + 4>i{h, S) (6) 
he:K 

s. t. %){h) <a + Mh,S). 

Theorem 3. The learning rule defined in Eqn. (6) is such that, for any 6 > 
and any n > 1, with probability at least 1 — 25 with respect to the draw of the 
training data, 

and ^ ^ 

^o{h:K,a) <a + 2(j)o{h:K,a, S). 

The result holds regardless of the data-generating distribution. 

Proof. Assume that both 

\^o{h) -^o{h)\ <Mh,S) forall/ie?t (7) 

and 

(h) - ^1 (/i) \<Mh,S) for all /i e J{, (8) 

which, by assumption, occurs with probability at least 1 — 26. By (7), we deduce 
the second half of the theorem from 

%){h:K,a) < %){h:K,a) + 4'o{h:K,a, S) <a + 2^o(/lM,a, S), 

where the second inequality follows from 3^o(^w,q) < a + 4>o{hji,ajS), which 
follows from the definition of h^,a. To get the first half of the theorem, observe 
that Mh*M,a) < Mh*K,a) + Mh*K,a^ 6)<a + Mh*^,^^ S). Therefore, h*^^^ is 
among the candidates in the minimization defining h-K^a- Then 

□ 

This theorem can immediately be combined with Theorem 1 to give per- 
formance guarantees for the case !Ro{h) = "RdQi) and 'Jl\{h) = JlNoili), for a 
countable class IK. In particular, define the rule 

hj{,a = arg min "Rnd {h) + (I)nd {h, 6) (9) 
he:H 

s. t. ^D{h) <a + (j)D{h,6). 

We have the following. 
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Corollary 2. Assume "K is countable. For any S > and any n > 1, with 
probability at least 1 — 25 with respect to the draw of the training data, the 
learning rule in (9) satisfies 



To extend such a result to a universally consistent estimator, based on the 
discussion of Theorem 2, it would be necessary to take J{ growing with the 
sample size n, and to exclude classifiers making too few discoveries or nondis- 
coveries. The details are similar to those of Section 4, and a formal development 
is omitted. 

5.1. Neyman-Pearson Classification 

If we take JIq and to be the false positive rate and false negative rate, 
respectively, we may apply Theorem 3 to recover and generalize known results 
for Neyman-Pearson classification [6,18]. Specifically, set 



There are several possible penalties that provide uniform bounds on the devia- 
tion between these quantities and their natural empirical estimates, 



where Uj := J27=i '^{Yi=j} ■ Examples of such penalties (e.g., VC and Rademacher 
penalties) are given in [17]. As a concrete example, we state a result here for 
the case of countable IK. Thus define the penalties 



and 



OlFp{h) := P(/i(X) = 1 1 r = 0) 
^FN{h) := P(/i(X) =0|y = 1). 





no > 
no = 




Define the rule 




(10) 



We have the following. 
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Corollary 3. Assume "K is countable. For any S > and any n > 1, with 
probability at least 1 — 25 with respect to the draw of the training data, the 
learning rule in (10) satisfies 

^FN(h:H,a) < 3^FiV + '^<pFN {h*^,oc, 6) 

and 

Wc note that Theorem 3, apphed in the context of Neyman-Pearson classi- 
fication, is a stronger result than those in [6,18], which do not explicitly allow 
penalties that depend on the classifier h. 

5.2. Precision and Recall 

As a final application of Theorem 3, we analyze the precision and recall per- 
formance measures, common in database information retrieval problems (see 
Introduction). Precision and recall can both be defined in terms of quantities 
already discussed. Denote the precision 

QpR{h) := P(r = 1 1 h{X) = !):=!- 

and the recall 

QRE{h) := P(/i(X) = 1 1 F = 1) = 1 - 'RpNih), 

and let Qpj^fh) := 1 — V.oih) and Q.RE{h) := 1 — ^FN{h) be the empirical esti- 
mates. In this setting the goal is to find the classifier with the largest precision, 
while maintaining a recall of at least /3, where /? is a user-specified level. Thus 
the optimal classifier in a given class Ji is 

3 = argmax Qpij(/i) 

h€M 

s. t. Qfls(/i) > p. 

Define the rule 

hy{,/3 = argmax Qp7j(/i) - (5) (11) 

s. t. QRE{h)>l3-(t)FN{h,5). 

We have the following. 

Corollary 4. Assume 3-C is countable. For any S > and any n > 1, with 
probability at least 1 — 25 with respect to the draw of the training data, the 
learning rule in (11) satisfies 

QpR{hj{,0) > QpR{h'^^/}) - 2(t>D{fi*^^0,5) 

and ^ ^ 

QRE{h:K,0) > /? - ^^FN{h:K,0, S). 
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Proof. To apply Theorem 3, note that maximizing Qpft,{h) is equivalent to min- 
imizing 3^£)(/i), that the constraint QuEih) > /? is equivalent to JlpN{h) < 
a := 1 — P, and similarly for the empirical objective and constraint. Further- 
more, since \QpR{h) - QpR{h)\ = \5lD{h) - and \QRE{h) - QreM = 
|3?F7v(/') — ^FNih)\, we have that the assumptions of Theorem 3 are satisfied 
with the stated penalties. □ 

6. Conclusion 

This paper demonstrates that FDR and FNDR control is possible in the context 
of statistical learning theory, where the distribution of {X, Y) is unknown except 
through training data. We develop empirical estimates of these quantities and 
derive uniform deviation bounds which assess the closeness of these empirical 
estimates to the true FDR and FNDR. Unlike most other performance measures 
in statistical learning theory, which are related to binomial random variables, the 
FDR and FNDR measures are related to ratios of binomial random variables, 
which requires the development of novel bounding techniques. These bounds 
arc then used to analyze learning rules that minimize a weighted combination 
of FDR and FNDR, or that minimize FNDR subject to a constraint on FDR. 
Our strong universal consistency result indicates that it is necessary to prevent 
the learning algorithm from selecting classifiers making too few discoveries or 
nondiscoveries, as error estimates for such classifiers may be poor. 

Extending our results to uncountable classes !K is an interesting open ques- 
tion, and may require the development of new techniques. The standard proofs of 
common generalization error bounds for uncountable classes, such as Rademachcr 
and VC penalties, rely on the introduction of an artificial "ghost" sample [7]. 
That technique would require every h E !K to have the same empirical number 
of discoveries (or nondiscoveries) on both the original and ghost samples, which 
is generally not the case. Recently El-Yaniv and Pechyony [11] have extended 
the ghost sample technique to cases where the training and ghost samples have 
difi'erent sizes (their results are stated in the context of transductive learning), 
and some of their arguments may be useful in this regard. 
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